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SYSTEM AND PROCESS FOR VALIDATING, 
ALIGNING AND REORDERING ONE OR MORE GENETIC SEQUENCE 
MAPS USING AT LEAST ONE ORDERED RESTRICTION MAP 

5 Field of the Invention 

The present invention relates to a system and process for a sequence validation 
based on at least one ordered restriction map, and more particularly to validating, 
aligning and/or reordering one or more genetic sequence maps (e.g., ordered 
restriction enzyme DNA maps) using such ordered restriction map via map matching 
10 and comparison. 

Background Information 

The sequence of nucleotide bases present in strands of nucleotides, such as 
DNA and RNA, carries the genetic information encoding proteins and RNAs. The 
15 ability to accurately determine a nucleotide sequence is crucial to many areas in 
molecular biology. For example, the study of genetics relies on complete nucleotide 
sequences of the organism. Many efforts have been made to generate complete 
nucleotide sequences for various organisms, including humans, mice, worms, flies 
and microbes. 

20 There are a variety of well-known methods to sequence nucleotides, including 

the Sanger dideoxy chain termination sequencing technique and the Maxam-Gilbert 
chemical sequencing technique. However, the current technology limits the length of 
a nucleotide sequence that may be sequenced. Techniques have been developed to 
sequence larger nucleotide sequences. In general, these methods involve fragmenting 

25 the large sequence into fragments, cloning the fragments, and sequencing the cloned 
fragments. The sequences can be fragmented through the use of restriction enzymes 
or mechanical shearing. Cloning techniques include the use of cloning vectors such 
as cosmids, bacteriophage, and yeast or bacterial artificial chromosomes (YAC or 
BAC). The nucleotide sequence of the fragments can then be compared, overlapping 

30 regions identified, and the sequences assembled to form "contigs," which are sets of 
overlapping clones. By assembling the overlapping clones, it is possible to determine 
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the sequence of nucleotide bases of the full length sequence. These methods are well 
known to tho s e having ordinary skill in the art 

The accuracy of nucleotide sequence data is limited by numerous factors. For 
example, there may be missing sections due to incomplete representation of the 

5 genomic DNA. There may also be spurious DNA sequences intermixed with the 
desired genomic DNA. Common sources of contamination are vector-derived DNA 
and host cell DNA. Also, the accuracy of the identification of bases tends to degrade 
toward the end of long sequence reads, Additionally, repeated sequences can create 
errors in the re-assembly and/or the mismatching of contigs. 

10 In order to reduce the sequence data errors, sequencing of the fragments is 

generally performed multiple times. To help reduce errors such as mismatching or 
misassembly resulting from repeated sequences, the "hierarchical shotgun 
sequencing 1 ' approach (also referred to as "map-based," "BAC-based" or "clone by 
clone 9 *) can be used. This approach involves generating and organizing a set of large 

15 insert clones covering the genome and separately performing shotgun sequencing on 
appropriately selected clones. Because the sequence information is local, fee issue of 
long-range misassembly is eliminated and the risk of short-range misassembly is 
reduced. 

Other known sequencing and characterization techniques involve generating 
20 restriction fragment fingerprints to determine whether close overlaps are present, 

thereby assembling the BACs into fingerprint clone contigs. Fingerprint clone contigs 
can be positioned along the chromosome by anchoring them with sequence-tagged 
sites (STS) markers from existing genetic and physical maps. These fingerprint clone 
contigs can be associated with specific STSs by probe hybridization or direct search 
25 of the sequenced clones. Clones can also be positioned by fluorescence in situ 
hybridization. Each of these known techniques are costly and time consuming. 

Another approach for characterizing nucleotide sequences involves the use of 
ordered restriction maps of single molecules. One specific technique used to produce 
single molecule ordered restriction maps is "Optical Mapping". Optical mapping is a 
30 single molecule methodology for the rapid production of ordered restriction maps 
from individual DNA molecules. Ordered restriction maps are preferably constructed 
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using fluorescence microscopy to visualize restriction endonuclease cutting events on 
individual fluorochrome-stained DNA molecules. Restriction enzyme cleavage sites 
are visible as gaps that appear flanking the relaxed DNA fragments (pieces of 
molecules between two consecutive cleavages). Relative fluorescence intensity 
5 (measuring the amount of fluorochrome binding to the restriction fragment) or 
apparent length measurements (along a well-defined ''backbone" spanning the 
restriction fragment) have proven to provide accurate size-estimates of the restriction 
fragment and have been used to construct the final restriction map. 

Such restriction map created from one individual DNA molecule is limited in 

10 its accuracy by the resolution of the microscopy, the imaging system (CCD camera, 
quantization level, etc.), illumination and surface conditions. Furthermore, depending 
on the digestion rate and the noise inherent to the intensity distribution along the DNA 
molecule, with some probability, one is likely to miss a small fraction of the 
restriction sites or introduce spurious sites. Additionally, investigators may 

15 sometimes (rather infrequently) lack the exact orientation information (whether the 
left-most restriction site is the first or the last). Thus, given two arbitrary single 
molecule restriction maps for the same DNA clone obtained this way, the maps are 
expected to be roughly the same in the following sense-if the maps are "aligned" by 
first choosing the orientation and then identifying the restrictions sites that differ by 

20 small amount, then most of the restrictions sites will appear roughly at the same place 
in both the maps. 

For instance, in the original method, fluorescently-labeled DNA molecules 
were elongated in a flow of molten agarose containing restriction endonucleases, 
generated between a cover-slip and a microscope slide, and the resulting cleavage 

25 events were recorded by fluorescence microscopy as time-lapse digitized images, The 
second generation optical mapping approach, which dispensed with agarose and time- 
lapsed imaging, involves fixing elongated DNA molecules onto positively-charged 
glass surfaces, thus improving sizing precision as well as throughput for a wide range 
of cloning vectors (cosmid, bacteriophage, and yeast or bacterial artificial 

30 chromosomes (YAC or B AC)). 
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A DNA sequence map is an "in silico" order restriction map that is obtained 
for a nucleotide sequence by simulating a restriction enzyme digestion process. The 
sequence data is analyzed and restriction sites are identified in a predetermined 
manner. The resulting sequence map has some piece of identification data plus a 
5 vector of fragments, whose elements encode the size in base-pairs. 

Sequenced clones can be associated with fingerprint clone contigs in the 
physical map by using the sequence data to calculate a partial list of restriction 
fragments in silico and comparing that list with the experimental database of BAC 
fingerprints. Genomic consensus maps are generated from optical maps using, e.g., 
10 "Gentig** software which is a conventional software that generates optical ordered 
restriction maps. 

It was previously unknown how to determine the accuracy of the DNA 
sequence maps. Indeed such detennination was either impossible or provided a small 
level of surety. It is one of the objects of the present invention to enable a validation 

15 of the DNA ordered sequence maps against the optical maps. Another object of the 
present invention is to enable an alignment and reordering of the DNA sequence maps 
based on the optical mapping. 

Approaches to aligning or reconstructing restriction maps have been described 
in RW. Myers et al., "An 0(N2 lg N) Restriction Map Comparison and Search 

20 Algorithm", Bulletin of Mathematical Biology, 54(4):599-61 8, 1992; R-M. Kaip et 
aL, "Algorithms for Optical Mapping", RECOMB 98, 1998; Parida, L., A Uniform 
Framework for Ordered Restriction Map Problems, Journal of Computational 
Biology, Vol 5, No 4, Mary Ann Liebert Inc. Publishers, pp 725-739, 1998; Gusfield, 
D., Algorithms on Strings, Trees, and Sequences, Cambridge University Press, 1997; 

25 and Lee, J.K., Dancik, V M and M.S.Waterman, "Estimation for restriction sites 

observed by optical mapping using reversible-jump Markov Chain Monte Carlo", L 
Comp. Biol., 5, 505-516, 1997. However, none of these publications disclose the 
novel processes and systems described herein below. 
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Summary of the Invention 

In general, an exemplary embodiment of the system and process for validating 
and aligning the simulated ordered restriction map against the optical ordered 
restriction map according to the present invention can be implemented as follows. 
5 First, each molecule may be cut in several places using a digestion process by one or 
more restriction enzymes as is known to those having ordinary skill in the art Each of 
these "cut" molecules can represent a partial DNA (optical) ordered restriction map, 
Then, it is possible to reconstruct a complete Genome Wide (optical) ordered 
restriction map. Such reconstruction process can be carried out by an iterative process 

I o which maximizes the likelihood of a plausible hypothesis given the partial map and 
the model of the error sources (e.g., a Bayesian-based process). 

It should be understood that the inputs to the Validation/Alignment system and 
process are preferably restriction maps (which include DNA sequences therein) and 
Genome wide (e.g. optical) ordered restriction maps (which can be represented as 

15 variable length vectors of segment/fragment information fields). Each segment 
information has two pieces of information associated therewith: size and standard 
deviation. The size may be a measure of the segment, which is proportional to the 
number of nucleotides present in the segment The standard deviation preferably 
represents the error associated with the segment size measurement Each map has 

20 associated therewith, e.g., two measures of how reliable the detection of cuts by the 
procedure is, i.e., the false positive probability and the digestion probability. The first 
measure relates to the event that the cut is detected incorrectly. The second measure 
relates to the event that the cut actually appears where it is reported. 

According to the present invention, the optical and simulated ordered 

25 restriction maps are compared to one another to determine whether and to what extent 
they match. The accuracy of a match is computed by minimizing the error committed 
by matching one map against the other at a given position. An exemplary 
mathematical model and procedure underlying this computation is preferably a 
Bayesian-based procedure/algorithm. The computation is based on a Dynamic 

30 Programming Procedure ("DPP"). However, it should be understood that other 
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procedures and algorithms can be utilized to compare to one another these maps to 
validate and align at least one such map, according to the present invention. 

Using the Bayesian-based exemplary procedure with the system and method 
of the present invention, hypothesis can be obtained and the probability of a given 

5 event (based on the hypothesis) may be formulated This probability is preferably a 
mathematical formula, which is then computed using a conventional model of various 
error sources. An exemplary optimization process which uses such formula may 
maximize or minimize the formula. 

In order to find the extreme value of the overall probability formula over all 

10 possible combinations of matches, a conventional DPP can be used on the problem 
which was defined by the Bayesian-based exemplary procedure as described above. 
For example, the DPP may preferably compute a set of extreme values for a 
mathematical formula defined above by extending a partial solution in a 
predetermined manner while keeping track of a particular number of alternatives. All 

15 of the alternatives may be maintained in a table, and thus do not have to be 
recomputed every time the associated likelihood or score function needs to be 
evaluated. 

Accordingly, a method and system according to the present invention are 
provided for comparing ordered segments of a first DNA map with ordered segments 

20 of a second DNA map to determine a level of accuracy the first DNA map and/or the 
second DNA map. In particular, the first and second DNA maps can be received (the 
first DNA map corresponding to a sequence DNA map, and the second DNA map 
corresponding to a genomic consensus DNA map as provided in an optical DNA 
map). Then, the accuracy of the first DNA map and/or the second DNA map is 

25 validated based on information associated with the first and second DNA maps. 

In another embodiment of the present invention, the first DNA map and/or the 
second DNA map are validated by determining whether one or more matches exist 
between ordered segments of the first DNA map and the ordered segments of the 
second DNA map. In addition, a number of the matches which exist between tide 

30 segments of the first DNA map and the segments of the second DNA map can be 
obtained. 
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Ih yet another embodiment of the present invention, the first DNA map and/or 
the second DNA map are validated by determining whether the first DNA map 
includes one or more cuts which are missing from the second DNA map. Also, a 
number and locations of the missing cuts based on the first and second DNA maps 

5 can be obtained thereafter. 

According to a further embodiment of the present invention, the first DNA 
map and/or the second DNA map are validated by determining whether the second 
DNA map includes one or more cuts which are absent from the first DNA map. The 
validation can also be performed by determining whether the first DNA map includes 

10 one or more cuts which are missing from the second DNA map, obtaining a first 
number and locations of the missing cuts based on the first and second DNA maps, 
detennimng whether the second DNA map includes one or cuts which are absent from 
the first DNA map, and obtaining a second number and locations of the absent cuts 
based on the first and second DNA maps. Furthermore, it is possible to generate an 

15 error indication if the number of the matches is less than a match threshold, the first 
number of the missing cuts is greater than a first predetermined threshold, and/or the 
second number of the absent cuts is greater than a second predetermined threshold. 

In another embodiment of the present invention, the first DNA map is an in- 
silico ordered restriction map obtained from a DNA sequence, which may include 

20 identification data and at least one vector of the segments of the first DNA map. At 
least one vector of the first segments can encode a size of base-pairs of the DNA 
sequence. Further, the second DNA map can include identification data and at least 
one variable-length vector representing its ordered segments. 

In still another embodiment of the present invention, the second DNA map is 

25 defined as a subsequence of a genome-wide ordered restriction map. Also, the 

validation is performed by determining the accuracy of at least one of the first DNA 
map and the second DNA map using the following probability density junction: 
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whereD is the second DNA map, A is the first DNA map, a is a standard deviation 
summarizing map-wide standard deviation data, p c is a probability of a positive cut of 

a DNA sequence, and p f is a probability of a false-positive cut of the DNA sequence. 

In another embodiment of the present invention, the accuracy can be validated 

5 as a Amotion of an orientation of the first DNA map with respect to an orientation of 
the second DNA map. Also, the validation can be performed by executing a dynamic 
programming procedure C'DPP") on the first and second DNA maps to generate a first 
table of partial and complete alignment scores, and first auxiliary tables and first data 
structures to keep track of number and locations of cuts and segment matches, 

10 receiving a third DNA map which is a reverse map of the first DNA map, executing 
the DPP on the second and third DNA maps to generate a second table of partial and 
complete alignment scores, and second auxiliary tables and second data structures to 
keep track of number and locations of the cuts and the segment matches, analyzing a 
last row of the first table and a last row of the second table to obtain at least one 

15 optimum alignment of the first and second DNA maps, and reconstructing an 

optimum alignment and/or sub-optimal alignments using the first and second auxiliary 
tables and data structures. 

According to still another embodiment of the present invention, the accuracy 
can be validated by matching an extension of one or more left aid segment of the 

20 segments of the first DNA map to at least one segment of the second DNA map 

and/or by matching an extension of one or more right end segment of the segments of 
the first DNA map to at least one segment of the second DNA map. Furthermore, it is 
possible to detect an alignment of the first DNA map with respect to the second DNA 
map, the alignment being indicative of sequence positions of the segments of the first 

25 DNA map along the second DNA map. 

In addition, other embodiments of the process and system according to the 
present invention are provided for aligning a plurality of DNA sequences with a DNA 
map. First, the DNA sequences and the DNA map can be received (the DNA 
sequences being fragments of a genome and the DNA map corresponding to a 

30 genomic consensus DNA map which relates to an ordered restriction - e.g. optical - 
DNA map). Then, a level of accuracy of the DNA sequences and the DNA map is 
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validated based on information associated with the DNA sequences and the DNA 
map. Hie locations of the DNA map at which the DNA sequences are capable of 
being associated with particular segments of the DNA map are located. Furthermore, 
it is possible to obtain locations of the DNA map (without the validation) by locating 
5 an optimal one of the locations for each of the DNA sequences for each of the 
locations. 

In another embodiment of the present invention, the locations are determined 
for each of the DNA sequences, they may be positions on the DNA map at which the 
corresponding DNA sequences are anchorable, and these locations can define at least 

10 one alignment of the DNA sequences with respect to the DNA map. The alignment 
may include multiple alignments of the DNA sequences with respect to the DNA 
map, and the multiple alignments may be ranked based on a predetermined criteria to 
obtain a score set which includes a particular score for each of the multiple 
alignments. The determination may be performed by providing the DNA sequences in 

15 a first order of the multiple alignments with respect to the DNA map and determining 
a position for each of the DNA sequences, with respect to the DNA map, by selecting 
the DNA sequences to be in a second order corresponding to the score set 

In still another embodiment of the present invention, the determination of the 
locations can be performed by restricting each of the DNA sequences to be associated 

20 with only one of the locations on the DNA map. Also, such determination may 
produce a single alignment of the DNA sequences with respect to the DNA map. 

In yet another embodiment of the present invention, the determination can be 
performed by locating an optimal one of the locations for each of the DNA sequences 
to obtain an alignment solution for each of the locations. Also, the locating of the 

25 optimal location may be repeated for each subsequent one of the locations and 

excluding the alignment solution from a preceding locating procedure. Furthermore, 
each subsequent locating procedure can be made by relaxing at least one particular 
constraint to determine the respective locations. The particular constraint preferably 
includes a first requirement that two of the DNA sequences are prevented from 

30 overlapping when associated with the respective locations on the DNA map. The 

particular constraint can include a second requirement that a maximum number of the 
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DNA sequences are associated with the respective locations on the DNA map, and a 
third requirement that an overall score of the alignment of the DNA sequences with 
respect to the locations on the DNA map is minimized or maximized. It is also 
possible to assign respective weighs to the second requirement and the third 
5 requirement 

r 

Brief Description of the Drawings 

For a more complete understanding of the present invention and its 
advantages, reference is now made to the following description, taken in conjunction 
10 with die accompanying drawings, in which: 

Figure 1 is a first exemplary embodiment of a system for validating, aligning 
and/or reordering a genetic sequence using an optical map via map matching and 
comparison according the present invention; 

Figure 2 is a second exemplary embodiment of a system for validating, 
IS aligning and/or reordering a genetic sequence using the optical map; 

Figure 3 is an exemplary embodiment of a validation procedure of a process 
according to the present invention; 

Figure 4 is an exemplary embodiment of the process according to the present 
invention for simulating a restriction digestion of the sequence map, and then 
20 validating the accuracy of the consensus optical order restriction map and/or the 
simulated map; 

Figure 5 A is a detailed flow chart of an exemplary validation technique 
utilized in the process shown in Figure 4; 

Figure 5B is a detailed illustration of an exemplary flow diagram of particular 
25 steps of Figure 5A in which fragments of the optical ordered restriction map are 

compared to fragment of the simulated ordered restriction map to obtain one or more 
set(s) of most likely matches; 

Figure 6A is a first exemplary illustration of a technique for matching a 
sequence map against a consensus optical map; 
30 Figure 6B is a second exemplary illustration of the technique for matching the 

sequence map against the consensus optical map in which the consensus optical map 
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a 

does not possess any false enzyme cuts and the sequence map does not have any 
missing enzyme cut(s); 

Figure 6C is a third exemplary illustration of the technique for matching the 
sequence map against the consensus optical map in which the consensus optical map 
5 does not possess any false enzyme cuts while the sequence map is missing the enzyme 
cut(s); 

Figure 6D is a fourth exemplary illustration of the technique for matching the 
sequence map against the consensus optical map in which the consensus optical map 
has a false enzyme cut and the sequence map does not have any missing enzyme cuts; 
10 Figure 6E is a fifth exemplary illustration of the technique for matching the 

sequence map against the consensus optical map in which the consensus optical map 
has a false enzyme cut and the sequence map is missing the enzyme cut; 

Figure 6F is a sixth exemplary illustration of the technique for matching the 
sequence map against the consensus optical map in which left fragments of each of 
15 the consensus optical and sequence maps are mismatched; 

Figure 6G is a sixth exemplary illustration of the technique for matching the 
sequence map against the consensus optical map in which right fragments of each of 
the consensus optical and sequence maps are mismatched; 

Figure 7 is a detailed illustration of the exemplary flow diagram of the 
20 validation procedure according to the present invention which utilizes dynamic 

programming principles and the sequence and consensus maps illustrated in Figures 
6F and 6G; 

Figure 8 is an exemplary embodiment of the process according to the present 
invention in which an alignment of a simulated order restricted map takes place after 
25 (or during) the validation technique has been implemented to determine the accuracy 
of the simulated order restricted map(s) and/or the consensus optical map(s); 

Figure 9 is a detailed illustration of the flow diagram of the process show in 
Figure 8; 

Figure 10 is a flow diagram of a particular set of steps in the process 
30 illustrated in Figure 9 in which best matches are selected for each sequence map and 
an overall alignment thereof is constructed; and 
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Figure 1 1 is an illustration of an example of a possible alignment of a 
chromosome arrangement using the system and process of the present invention. 

Detailed Description 

5 Figure 1 illustrates a first exemplary embodiment of a system for validating, 

aligning and/or reordering a genetic sequence using an optical (consensus) map via 
map matching and comparison according to the present invention. In this 
embodiment, the system includes a processing device 10 which is connected to a 
communications network 100 (e.g., the Internet) so feat it can receive optical 

10 sequence mapping data and DNA sequence data. The processing device 10 can be a 
mini-computer (e.g., Hewlett Packard mini computer), a personal computer (e.g., a 
Pentium chip-based computer), a mainframe (e.g., IBM 3090 system), and the like. 
The DNA sequence data can be provided from a number of sources. For example, this 
data can be GenBank Data 110 obtained from GenBank database (NTH genetic 

15 sequence database), Sanger Data 120 obtained from Sanger Center database, and/or 
Celera Data 130 obtained from the Celera Genomics database. These are publicly 
available genetic databases, or - in the last case - private commercial genetic 
databases. The optical sequence mapping data correspond to optical mapping data 140 
that can obtained from external systems. For example, such optical map data, i.e., 

20 optical mapping ordered restriction data, can be generated using the methods 

described in U.S. Patent No. 6,174,671, the entire disclosure of which is incorporated 
herein by reference. In particular, the methods described in this U.S. patent produce 
high-resolution, high accuracy ordered restriction maps based on data created from 
images of populations of individual DNA molecules digested by restriction enzymes. 

25 As shown in Figure 1, after the processing device 10 receives the optical 

mapping data and the DNA sequence data via the communications network 100, it can 
then generate one or more results 20 which can be a vaHdation/determination of the 
accuracy of the DNA sequence data and/or of the optical mapping data, an alignment 
of the DNA sequence data based on the results of the validation procedure, and 

30 reordering thereof Figure 2 illustrates another embodiment of the system 10 according 
to the present invention in which the optical mapping data 140 is transmitted to the 
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system 10 directly from an external source, without the use of the communications 
network 100 for such transfer of the data. In this second embodiment of the system as 
shown in Figure 2, the DNA sequence data 110, 120, 130 is also transmitted directly 
from the one or more of the DNA sequence databases (e.g., the Sanger Center 
5 database, the Celera Genomics database and/or the GenBank database), without the 
need to use the communications network 100 shown in the first embodiment of Figure 

1 . It is also possible for the optical mapping data 140 to be obtained from a storage 
device provided in or connected to the processing device 10. Such storage device can 
be a hard drive, a CD-ROM, etc. which are known to those having ordinary skill in 

10 the art. 

A. VALIDATION PROCESS AND SYSTEM 
General Flow Diagram 

Figure 3 is an exemplary embodiment of the process according to the present 
15 invention which is preferably executed by the processing device 10 of Figures 1 and 

2. In this exemplary embodiment, the optical mapping data 140 is forwarded to a 
technique 250 which constructs one or more consensus maps 260, based on this data 
140 by considering the local variations among aligned single molecule maps. One 
example of such technique 250 is a "gentig" computer program as described in T. 

20 Anantharaman et al., "Genomics via Optical Mapping II: Ordered Restriction Maps", 
Journal of Computational Biology, 4(2), 1997, pp. 91-1 18, and T. Anantharaman et 
al., "Genomics via Optical Mapping HI: Contiging Genomic DNA and Variations", 
AAAI Press, 7th International Conference on Intelligent Systems for Molecular 
Biology, ISMB 99, Vol. 7, 1999, pp. 18-27, the entire disclosure of which are 

25 incorporated herein by reference. In particular, "gentig" software uses a Bayesian- 
based (probabilistic) approach to automatically generate "contigs" from optical 
mapping data. For example, "contigs" can be assembled over whole microbial 
genomes. The "gentig" software repeatedly combines two islands that produce the 
greatest increase in probability density, excluding any "contigs" whose false positive 

30 overlap probability are unacceptable. For example, four parameters in the program 
can be altered to change the number of molecules that the program "contigs** together, 
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thus forming the consensus maps. The details of the consensus maps shall be 
described herein below in further details. 

According to the exemplary embodiment of the present invention, the DNA 
sequence data (e.g., the GenBank data 1 10, the Sanger data 120 and the Celera data 

5 13) can be collected at a database collection junction 200, which can be a computer 
program executed by the processing device 10. This collection can be initiated and/or 
controlled either manually (e.g., by a user of the processing device 10 to obtain 
particular DNA sequences) and/or automatically using the processing device 10 or 
another external device. Upon the collection of the DNA sequence data from one or 

10 more of the DNA sequence databases 1 10, 120, 130, the database collection junction 
200 outputs a particular DNA sequence 210 or a portion of such DNA sequence. 
Thereafter, the data for this DNA sequence 210 (or a portion thereof) is forwarded to 
a technique 220 which simulates a restriction enzyme digestion process to generate an 
"in silico" ordered restriction sequence map 230. 

15 Thereafter, the system and process of the present invention executes a 

validation algorithm 270 which determines the accuracy of the ordered restriction 
sequence map 230 based on the data provided in the optical consensus map(s) 260. 
This result can be output as or more results 280 in the form of a response a score (e.g., 
a rank for each ordered restriction map), a binary output (e,g,, the accuracy validated 

20 vs. invalidated), etc. 

Provided herein below is a detailed information regarding the consensus maps 

and the sequence maps. 

Consensus (Optical) Man 
25 The consensus optical map can be defined as a genome-wide, ordered 

restriction map which is represented as a structured item consisting of particular 
identification data and a variable length vector composed of fragments. For example, 
the consensus map can be represented by a vector of fragments, where each fragment 
is a triple of positive real numbers. 

30 (Cf»/j 9 Of}€ A 3 



WO 02/26934 PCT/US01/30426 

45- 

and where Ci is defined as the cut probability associated with a Bernoulli Trial, // is the 
fragment size, related to the mean of a random variable with Gaussian distribution 
having an estimated standard deviation equal to cr/. For example, the total length of 
the fragment vector as can be defined as N, Also, it is possible to define an index the 
5 vector of fragments from 0 to N-l . 

The consensus maps can be created from several long genomic single 
molecule maps, where each molecule map thereof may be obtained from the images 
of the molecules stretched on a surface and further combined by a Bayesian algorithm 
implemented in the "gentig" program. As described above, the "gentig" program is 
1 0 capable of constructing consensus maps by considering local variations among the 
aligned single molecule maps. 

Sequence Map 

As is generally known, a sequence is a string of letters obtained from a set {A, 
15 C, G, T, N, X}* These letter have a standard meaning in the art if bio-informatics. In 
particular, the letters A, C, G, T are DNA bases, N is "unknown", and X is a "gap". 

A sequence map is an "in silico" ordered restriction map obtained from the 
sequence by simulating a restriction enzyme digestion process. Hence, each sequence 
map has some piece of identification data plus the vector of fragments, whose 
20 elements encode exactly the size in base-pairs* The sequence map fragment vector j- 
th element is defined as a number a> which is the size of the fragment The total length 
of the sequence map fragment vector is defined as M. The fragment vector is indexed 
fromOtoM-1. 

Thus, each sequence map has at least a portion of identification data of the 
25 DNA sequence data 1 1 0, 1 20, 1 3 0, in addition to the vector of fragments whose 

elements encode exactly the size in base-pairs. The sequence map fragment vector j-th 
element is indicative of a number a, which corresponds to the size of the fragment. As 
an example, the total length of the ordered restriction sequence map fragment vector 
can be M. Thus, the fragment vector can be indexed from 0 to M - L 
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Overall Process Description 

Figure 4 shows an exemplary flow chart of the embodiment of the process 
according to the present invention for simulating a restriction digestion of the 
sequence map, and then validating the accuracy of the consensus optical order 

5 restriction map and/or the simulated ordered map. This process can be performed by 
the processing device 10 which is shown in Figures 1 and 2. As shown in this flow 
chart, the processing device 10 receives the optical ordered restriction data in step 
310, which can be the consensus optical map(s) 260 shown in Figure 3. Then, in step 
320, the processing device 10 receives the DNA sequence data, which is preferably 

10 the DNA sequence 210 which is also shown in Figure 3. In step 330, the restriction 
digestion of the sequence data is simulated to obtain the simulated (in silico) ordered 
restriction map which is also shown in Figure 3 as the sequence map(s) 230. 
Thereafter, in step 340, the accuracy of the optical ordered restriction map and/or of 
the simulated ordered restriction map is validated, preferably to locate likely matches 

15 within one another. Finally, the results of the validation are generated in step 350. 

Exemplary Embodiment of Validation Procedure of the Exemplary Process 
Figure SA shows a detailed flow chart of an embodiment of the exemplary 
validation procedure utilized in step 340 of the process shown in Figure 4. In 

20 particular, a current fragment of the optical ordered restriction map is compared to a 
respective fragment of the simulated ordered restriction map to obtain one or more 
set(s) of most likely matches (step 3410). Then, the processing device 10 determines 
if all fragments of the simulated ordered restriction map were checked in step 3420. If 
not, the process takes the next fragment of the simulated ordered restriction map to be 

25 the current fragment for checking performed in step 3430, and the comparison of step 
341 0 is repeated again for the current fragment of the simulated ordered restriction 
map. Otherwise, because it is determined that all fragments of the simulated ordered 
restriction map were checked, all of the matches are ranked in step 3440, and the 
processing device 10 determines the best match(s) in step 3450. If the processing 

30 device 1 0 determines that the rank of the best match(s) is greater than a predetermined 
threshold (step 3460), the processing device 10 validates the accuracy of the optical 
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ordered restriction map and/or of the simulated ordered restriction map (step 3470). 
Otherwise such accuracy is not validated in step 3480. It should be understood that 
the exemplary validation procedure shown in Figure 5 A can be performed for one or 
multiple iterations over the fragments. 

5 Figure 5B shows a detailed illustration of an exemplary flow diagram of steps 

3410-3430 of Figure 5A in which the fragments of the optical ordered restriction map 
are compared to the fragment of the simulated ordered restriction map to obtain one or 
more set(s) of most likely matches. Particularly, in step 4010, Probability 
Pr(D|fl( j>oP$), as shall be described in further detail below, is calculated for each 

10 possible alignment of the fragments of the optical ordered restriction map (Le., the 
consensus map) against fragments of simulated ordered restriction map (i.e., the 
sequence map). Then, in step 4020, an overall match probability as a maximum 
likelihood estimate ("MLE") is calculated by extending the computation over all 
fragments of the consensus map and all fragments of the sequence map. 

15 The exemplary applications of the exemplary embodiment of the process 

according to the present invention on the sequence and consensus maps are provided 
in further detail below with reference to Figures 6A-6G. 

Statistical Description of the Problem 
20 Figure 6A shows an exemplary setup of the matching procedure involving a 

sequence map (corresponding to the simulated ordered restriction map) and a 
consensus map (corresponding to the optical ordered restriction map). The sequence 
map is preferably considered to be an ideal map, i.e., viewed as the hypothesis H of a 
Bayesian problem to be analyzed, while the consensus map is preferably considered 
25 to be of data D to be validated against hypothesis H. In this manner the following 
probability density function is formed 

Pr(Z)|ff( , Pe> pj)) 9 

where is a standard deviation which summarizes maps wide standards deviation 
data (e.g., -f( j) for some function c /), p c is the cut probability, and p/ is the false 
30 positive cut probability. This calculation is shown in Figure 5b and discussed above. 



* 
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Ideal Scenario 

la an ideal scenario, the orientations of the sequence maps are known, there 
are no false cuts, and no missing cuts, i.e., p c - 1, and p/ = 0, thus the terms associated 
with these parameters vanish, as it shall be described in further detail below. For 

5 example, if a position A in the consensus map is taken, the consensus map fragment 
sub-vector is provided from the position h to iV— 1 . Also, the foil fragment vector of 
the sequence map can be, e,g., from 0 to M - 1 . For the sake of simplicity of the 
explanation of the present invention, it is possible to remove the h position term of the 
consensus map fragment sub-vector, and count the consensus map fragments from the 

10 position term 0 so that expressions such as instead of can be utilized 

To obtain a 4, match" between the Mh fragments of the consensus map and the 
corresponding fragments of the sequence map, it is preferable to evaluate to what 
extent the consensus map and the sequence map deviate from one another. A 
Gaussian distribution should preferably be utilized for the z-th fragment of each of the 

15 maps, and the following expression may be evaluated: 




20 



25 



Given the above expression, and with the assumption that the sequence map is 
correct (i.e., Vr(H) = 1), the overall Pr(D | H(a, ...)) function can be provided as: 



f 



Vr(p\H{ ,...)) = IT 



f=0 




2af 



To maximize the likelihood of the validation, it is preferable to utilize the 
logarithm of the simplified expression and obtain the following expression: 

ln(Pl(D|27( ,...)))« Jin 



M) 



1 


n 

-L 




{j2mjf J 


i=0 





This express maximizes logarithmic likelihood, therefore it provides a Maximum 
Likelihood Estimate ("MLE"). 



It 
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Since it is possible to assume that the first term of the MLE does not vary 
extensively ftom one location to another, it is preferable to simplify the problem by 
minimizing a "weighted sum-of-error-square** cost function. 




5 Minimizing function F(Z),...) may yield the **best match" of the sequence map 
(represented as H) against the consensus map (represented as D). 

According to the present invention, it is preferable to take into account the two 
possible orientations of the sequence map with respect to the consensus map. Below, 
false cuts and missing cuts in the consensus map are considered. 

0 

Orientation 

Since the sequence map can be evaluated against the consensus map by 
"reversing" its orientation, the expression for Pr(D, , . . . [ H) can be rewritten as: 



15 Pr(D, | H(. . .)) = max[Pri(A I #0 - OX Pr 2 (£> I ***C . .)]> 

where represents the reversed sequence map. As provided previously, it is 
possible to construct the function F as: 



F(D,H) - max[Fi(D,#)>F2(A &)l 



20 



25 



Thus, the expression for F 2 (D, H**) will be as follows: 



F 2 (D,H f ) = J 



h=0 



2a? 



False Cuts and Missing Cuts 

In order to correctly model errors in the matching process, it is preferable to 
take into account false cuts and missing cuts. For example, the matching process can 
be modeled with two parameters: 

• Missing restriction sites in the sequence map are preferably modeled 
by a probability p c (i.e., a "cuf* probability). In particular, p c 1 means 
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10 



15 



that the restriction sites are actually present in the map, 0<p c < 1 
means that there are some missing cuts, etc. 

• False restriction sites in the consensus map are preferably modeled by 
a rate parameter p f (i.e., a "false" cut probability). In an exemplary 
case, 0< pf < 1 means that the consensus map may have some false 
cuts. 

These parameters should preferably be included in the expression describing Pr(. . .) 
and, therefore in the function F(. . .) described above. 



Example 1 : No missing cuts and no false cuts. In this example as shown in Figure 
6B, the term for the matching of the z-th fragment of die sequence map 610 against 
the i-th fragment of the consensus map 620 should preferably take into account the cut 
probability p c . Thus, the expression is as follows: 



Pc* 



1 



which yields the cost function, after taking the negative log likelihood. 



f 



In 



Pc 2a, 2 
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Example 2: Missing cuts and no false cuts. In this example and as shown in Figure 
6C, the exemplary embodiment of the system and method of the present invention 
considers a cut in the sequence map 630 that has no corresponding cut in the 
consensus map 610. A match is attempted of the z-th consensus map fragment against 
the aggregation of the j and j - 1 fragments in the sequence map 630. For example, 
the computation of the Gaussian expression should be '^penalized" by taking into 
account the missing cut. The main term is preferably modeled as: 
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Pa* 




yielding a cost function: 



5 




V2n< 



, J 2a? 




Example 3: No missing cuts and some false cuts. In this case and as shown in 
Figure 6D, the converse case of Example 2 is being considered. A false cut event of 
the consensus map 640 can be modeled as a Bernoulli trial with probability p/. For 
example, the full term for such matching would likely aggregate fragments i and j - 1 
10 of the consensus map 640 against the j-th fragment of the sequence map 620. The full 
team would likely be: 



1 5 Taking the negative log likelihood again, the following expression is obtained: 



It should be noted that for the current data obtained from the optical mapping process, 



20 pf 10. This current data often dominate the complete expression, 

Example 4: Some missing cuts and some false cuts. Of course, it is conceivable 
that there may be missing cuts and false cuts together as shown in Figure 6E. It is 
possible to accurately match or align the i - u cut in the sequence map 660 against the 
25 j-v cut in the consensus map 650. It is also possible to properly match the (i + l)-th 
cut (the cut immediately following the j-th fragment in both the consensus map 650 
and the sequence map 660) in die two maps by appropriately treating all the 
intervening missing cuts in sequence map 660 and all the intervening false cuts in the 
consensus map 650. En this case, the '^matching term" has the following general form: 
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Pc* 



1 



2,2 , , 2 ~\ 
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Taking the negative log likelihood, the following expression is obtained: 
-In p e + ln(^2jc(cT f 3 +o*(m)+..-+ct £-,))) 

ft + + • ■ ■ + ^ )- (°j + ± • ; • • + a o^, ) y 



+(w-l)ln-^ — 



+(v-l)ln 



B. DYNAMIC PROGRAMMING PROCEDURE 

15 The validation of a sequence map against the optical map can be implemented 

as a dynamic programming procedure ( n DPP M ). Detailed descriptions of the DPP are 
provided in T. H. Cormen et al., "Introduction to Algorithms", The MIT Press and 
McGraw-Hill, 1990, and D. Gusfield, "Algorithms on Strings, Trees, and Sequences", 
Cambridge University Press, 1997, the entire disclosures of which is incorporated 

20 herein by reference. An exemplary DPP for the process according to the present 
invention is as follows: 



Procedure sequence-map-validate (sequence-map, consensus-map) 
/* Other parameters will be specified . . . e.g.,/7/,/?c k, etc. */ 
25 begin 

run DPP on consensus-map and sequence map; 

run DPP on consensus-map and reversed sequence map; 
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end 



collect the k "best" alignments by examining the last row of both DPP 
tables and "return " them; 



This DPP procedure can be executed two or more times. It is improbable for 
two alignments for the sequence map and for its reversed version to have equivalent 
scores. It is preferable to start from the DPP's main recurrence to obtain a formulation 
of the sequence map vs. consensus map matching expression. 



10 



15 



20 



Dynamic Programming "Main" Recurrence 

For the description provided below, index j shall be used to indicate a 
fragment in the consensus map, and the index j to indicate a fragment in the sequence 
map. Assuming that the consensus map has M fragments and that the sequence map 
has N fragments, the DPP may preferably utilize a NkM matching table T. 
Considering the entry T[i, j], this entry will likely contain the partially computed 
value of the matching function F(. . .). For example, F(. . .) would be incrementally 
computed from "left" to "right" by taking into consideration all possible fragment by 
fragment matches. 



The main recurrence for entry T[i, j] is provided as follows: 



T[i,j]- 



nun 
0<u£i 
0<vZj 



+ln 



T[i-u j-v] • 

p < J 

1 (fe + h-t) + • • • + hi-*))* ( a j + a u-D + ■ • • + a i/-«w 

2 ( CT < 2+tJ (M) + -- +a (<-»)) 

+ (w-l)hi— — 

I-Pc 

+(v-l)m— 
Pf 



The determination of the respective sizes of u and v should be performed. In 
one exemplary embodiment of the present invention, the sizes of u and v should 
preferably depend on t's. In another exemplary embodiment of the present 
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invention, u and v may also depend also on the digestion rate of the "in vivo" 
experiment that breaks up the DNA molecule. However, a pragmatic bound may be 
equal to, e.g., 3 times the overall standard deviation (which in practice can be 
approximated by the value 3). This bound may preferably become a parameter of the 
5 DPP. In this way, the computation for each entry T[y] should consider 
approximately 9 neighboring or adjacent entries. 

A simple model for the initial conditions should preferably be as follows: 

T[i,0]^0,/orye[l,M] 

10 

In this model, it is preferably to never match or strongly penalize a match of the first 
fragments of the consensus map against an "inner" fragment of the sequence map (cf. 
first column having a oo value). Also, the match of any fragment of the consensus 
map can be made against the first fragment of the sequence map rather neutral (with 
15 the first two zero values). A more complex model initializes the first row of the 
dynamic programming table by taking into account, e.g., only the size of the i-th 
fragment Provided below is an exemplary description of a complete model for the 
above-referenced boundary conditions. 



20 Left and Right End Fragment Computations. 

It is possible to provide a more sophisticated and accurate model for the left 
fragments and right fragments calculations (i.e. for the initial and final conditions). 
Such models take into consideration die case in which certain fragments on either the 
left or the rigjit of the sequence map do not '^properly match" any fragment in the 

25 consensus map. 



I. Left End Penalty Computation 

As shown in Figure 6F, the first ''matching fragments" are a 2 from the 
sequence map 680, and lj from the consensus map 670, identified by their size. The 
30 general case is for fragment i of the sequence map 680 to match fragment j of the 
consensus map 670. 
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An analysis of the fragment a^of the sequence map 680 is as follows. Most of 
the time, the left end of this fragment qq (which can assume not to be corresponding to 
an actual restriction site) will Ml within the boundaries of fragment i-n of the 
consensus map 670 (for 0£ n <, t). 

Within this framework, the minimum value that can be assigned to a ''match" 
of the left end fragments of the sequence map 680 corresponds to one of three cases: 

• Match by extension of the first left end fragment of the sequence map 
680. 



10 



Bad matches until fragment i of the sequence map matches fragments j 
of the consensus map 670. 

Match without extension to some fragment in the consensus map 670. 



15 



20 



Example 1: Extending ooby x leads to a match. If oq is "extended" by an extra size 
x(as shown in Figure 6F)» x is extended as far to the left as possible to match the cut 
on the left of fragments i - n (e.g., fragment of size illustrated in Figure 6F). 

The value of this match (which is built on top of the derivation performed for 
the '"regular case") is provided by the following expression: 

{ 4 2lz fa M +a (K*-i)) + ■ • • +a * ] 



In 



2(a ^ +a (K^i)) + — +CT f ) 



x 
+— 

L 



+ (*-l)ln— 
Pf 

+yin— — . 

This case express depends on two parameters which did not appear in the 
25 regular case: 
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x being the size extension {please note it in the second and the third 
term), and 

L being the molecule map average fragment size. 



5 The second sub-term is preferably the regular "sizing error" penalty which 

takes into account the extension x. The third sub-term may add an extra penalty based 
on the amount of the end fragment being stretched with respect to the overall structure 
of the expression. To utilize the expression, it is beneficial to find where its minimum 
with respect to the position of x. By differentiating in this manner, the expression can 

10 be minimized by setting x as follows: 

«-tu*wi*-*4)-kH-t^-»«> ( '^ 4< ^y*"" H "' ) 

By substituting this value for x in the original expression, the following expression is 
obtained: 

ih-) + Wi)) + - ■ ■ + 1 ' ) ~ fa + <h + • • - + a J )) 

L 

+ (~ 2? +CT { ^"" ,)) + - +CT ' ? ) 

+ nln— 
Pf 

+yln 1 



Again, the last two sub-terms may account for the false cuts and the missing cuts, 
20 respectively. It is possible to assume that there is at least one "good" cut in the 
sequence map. 

Example 2: No extension and bad matches until i and j. In this case, the first 
"good match" is located when fragment i of the sequence map matches fragments j of 
25 the consensus map. The expression corresponding to this case is 
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This expression takes into consideration (and possibly corrects) all missing matches 
and the false matches in both maps (e.g., the./ +1 term takes into account the 0-th cut 
as a missing cut). 



10 



15 



20 



25 



Case 3: Match without extension to some fragment in the consensus map. It 
shall be assumed that a "good match" exists between fragment i of the consensus map 
and fragments j of the sequence map, and, as with Example 1 of this subsection, the 
fragment from the consensus map (which is within which the end of fragment 0 - size 
oo - of the sequence map lies) is indexed i - n. 

A match of the fragment 0 of the sequence map to any of the n fragments up to 
fragment i of the consensus map as then attempted. All possible missing cuts and false 
cuts along the way are taken into consideration. The attempt of minimizing the 
following expression (dependent on A) will likely compete against the expressions in 
Examples 1 and 2 for the best end match. 



N) +ff WM)) 



+(fc-l)ln— 
Pf 

+ yln- 

II. Right End Penalty Computation 

Figure 6G shows an exemplary illustration of the maps which are utilized for 
the right end penalty computation, i.e., for fragments trailing the end of the sequence 
map 690 and/or of the consensus map 680. This computation is almost symmetric to 
the left end penalty computation described above. 

However, there is a difference to be taken into account for the right end 
computation which makes the computation asymmetrical with respect to the left end 
penalty computation described above. When the 'last good match" between fragment 
i of the consensus map 670 and fragment j of the sequence map 690 is considered, a 
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consideration of what is the score of the match up to that point should also be 
undertaken. In particular, the value T[], i] should be considered (thus assumed to be 
available at that point). 

Thus, as per the left end computation, three terms should be considered. They 
5 are analogous to the three terms for the left end computation, but they should be 
augmented with Ify, t\ to be meaningful. 

IIL Description of the Exemplary Validation Procedure 

Figure 7 shows a detailed illustration of the exemplary flow diagram and 
1 0 architecture of the validation procedure according to the present invention which 
utilizes dynamic programming principles and the sequence and consensus maps 
illustrated in Figures 6F and 6G. Each box represents the solution of a "dynamic 
programming'Mike problem, hi particular, the map data is provided to a left end table 
360 which then passes at least a portion of such data to a middle table 365. The output 
15 of both the left end table 360 and the middle table 365 are combined in block 370, and 
the combined results are forwarded to a results table 1 375. Then, at least a portion of 
the data from the results table 1 375 is passed to a right end table 380, and the 
combined results are forwarded to a results table n 385. The data in the results table I 
375 and the results table II 385 are computed using the scores contained in the other 
20 tables (e.g., the left end table 360, the middle table 365 and the right end table 380). 
The overall computation uses these three tables 360, 365, 380 as follows: 

• the .] for the middle table computation; 

• the JZ[. y .] for the left end penalty computation; and 

• the TR[> .) for the right end penalty computation. 

25 It is also possible to re-use certain tables to save memory and system resources of the 
processing device 10. The flow of control produces the content of each table 360, 
365, 380, in turn, and the final resulting table (e.g., the results table H 385) can be 
examined to reconstruct the alignment trace-back. 
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IV. Possible Optimization 

Filling the entire ![., .] table, i.e., the middle table 365, may take on the order 
of 4 times O^Mnm^N,^) to complete, where Nis the size of the sequence map 
and M is the size of the consensus map. However, it is possible to optimize the filling 
5 of the middle table 365 down to 0(NMmin(N,M)) by utilizing the limiting argument 
on the computation performed for each entry T[i J] m Because of the limit on u and v, 
the computation time for each entry can be considered "constant* 1 . 

In a simple setup, the middle table 365 may take up 0(NM) space, hence it too 
may be quadratic even when extra '"backtrace recording" is considered, as described 
10 in Gusfield, D., "Algorithms on Strings, Trees, and Sequences", Cambridge 
University Press, 1997. 

It is also possible to optimize the execution time via a hashing scheme 
similarly to the scheme used in the "gentig" program. In such case, the time 
complexity can be reduced by a farther order of magnitude, 

15 

Experimental Results 

The first experiments using software based on the system and method 
described above checked "in silico" maps obtained from Plasmodium falciparum 
sequence data against optical ordered restriction maps for the same organism. 

20 I. Plasmodium falciparum Sequence Data 

The sequence for the Pasmodium falciparum's 14 chromosomes was obtained 
from the Sanger Institute database (www.sanger.ac.uk) and from the TIGR database 
(www.tigr.org). The experiment cut the sequences "in silico" using the BamHI 
restriction enzyme. The resulting maps were fed to the software (implementing the 
25 process according to the present invention) along with appropriate optical ordered 
restriction maps. 

The results of the experiments on chromosome 2 and chromosome 3 (showing 
a number pf fragments) are provided below, as well as the experiment on all 
chromosomes using a particular enzyme (e.g., Nhel). 



30 
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Chromosome 



Number of Fragments 



front DB reversed 



chr2 



30 23 



chr3 



36 28 



Two "in silico" maps were provided for the chromosome 2 and chromosome 3 
sequences with the fragment numhers obtained being provided in the table above. 
The molecule maps thus produced were then sent to the validation checker alongside 
5 various consensus maps. 

IL Plasmodium falciparum Optical Ordered Restriction 

An optical ordered restriction map published in J. Jing et al., "Optical 
Mapping of Plasmodium Falciparum Chromosome 2", Genome Research, 9:175-181, 
10 1 999 and Z. Lai et al., "A shotgun optical map of the entire Plasmodium Falciparum 
genome", Nature Genetics, 23:309-313, 1999, and the maps generated by the "gentig" 
program were utilized for this experiment. The "gentig" program provided the use of 
the indication of the overall standard deviation to be used for each fragment of the 
consensus map. The parameter used was: 



where / is the fragment size and L is the average consensus map fragment size, 
in. Validation Procedure Results 

The validation DPP according to the present invention was executed on 
25 chromosome 2 and chromosome 3. The DPP ran with the following limitations: 

• The u and v parameters for the main recurrence formula were set to 3 . 



15 



<f = 4.4754 Kbps, 




20 
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• The procedure for matching the left and right ends of the sequence maps 
using the special computations described above was not utilized. 

The summary of the results are provided below in Tables 1-3. Table 1 and 3 
show the match of the sequence maps for chromosomes 2 and 3 against the consensus 
5 maps generated by the "gentig 11 • Table 2 shows the match of the sequence maps 

against the consensus map which as published in M. J. Gardner et aL, "Chromosome 2 
sequence of the human malaria parasite Plasmodium Falciparum 0 , Science, 282: 1 126- 
1132, 1998. The position of the matches of the sequence against the consensus maps 
are also shown in Tables 1-3. 



Table 1 - Chromosome 2 Validation Summary A 



rank 


matclies 


score 

• 


map id 


# missing 
cuts 


# false 
cuts 


1 


29 


80.869 


1302 


0 


1 


2 


28 


105.861 


1302 


2 


1 


3 


18 


126.956 


1326 


12 


4 


4 


22 


127.488 


1305 


8 


4 


5 


18 


132.890 


1414 


12 


2 



In particular, Table 1 shows the data for the best '"matches" found by the 
validation procedure of the present invention for the case of Plasmodium falciparum 
chromosome 2. The "in silico" sequence map was obtained from the TIGR database 
sequence. The sequence map (as well as its reversed) was checked against 75 
15 (optical) consensus maps produced by the gentig program. The 75 optical maps cover 
the entire Plasmodium falciparum genome. The validation procedure located its best 
matches against the map tagged 1302. 



Table 2 - Chromosome 2 Validation Summary B 



rank 


matches 


score 


map id 


# missing 
cuts 


# false 
cuts 


1 


29 


77.308 


NYU-WISC 


1 


0 


2 


22 


125.088 


NYU-WISC 


8 


2 


3 


22 


130.866 


NYU-WISC 


8 


4 


4 


24 


131.475 


NYU-WISC 


6 


1 


5 


24 


132.838 


NYU-WISC 


6 


4 
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Table 2 shows the data for the best "matches" found by the validation 
procedure of the present invention for the case of Plasmodium falciparum 
chromosome 2. The "in silico" sequence map was obtained from the TIGR database 
sequence. The sequence map (as well as its reverse) was checked against the map 
5 published in M. J. Gardner et al. publication. 



Table 3 - Chromosome 3 Validation Summary 



rank 


matches 


score 


map id 


# missing 


# false 










cuts 


cuts 


1 


35 • 


108.360 


1365 


1 


0 


2 


32 


117.571 


1365 


4 


1 


3 


32 


119.956 


1365 


4 


2 


4 


35 


121.786 


1296 


1 


3 


5 


31 


125.265 


1365 


5 


1 



Table 3 shows the data for the lr best" matches found by the validation 
procedure of the present invention for the case of Plasmodium falciparum 
chromosome 3. The "in silico" sequence map was obtained from the Sanger Institute 

10 database sequence. The sequence map (as well as its reversed) was checked against 
75 (optical) consensus maps produced by gentig. The 75 optical maps cover the 
entire Plasmodium falciparum genome. The validation procedure located its best 
matches against the map tagged 1365. 

The processing device 10 of the present invention was executed at 

15 approximately 75x4- 300 DPP instances in about 5 minutes during the experiment. 
Also, during this experiment, the processing device 10 kept track of all the 
intermediate results and made them available for interactive inspection after the actual 
execution. Also, the sequence, the sequence map, and the consensus maps, were 
always available for inspection and manipulation 

20 

IV. Conclusion 

The statistical model of an exemplary embodiment of the present invention is 
essentially a formulation of a maximum likelihood problem which is solved by 
minimizing a weighted sum-of- square-error score. The solution is computed by 
25 constructing a "matching table" using a dynamic programming approach whose 
overall complexity is of the order 0(N Mmm(N, M)) (for our non-optimized 
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solution), where N is the length of the consensus map and Mis the length of the 
consensus map. The preliminary results of the experiment described above illustrate 
how the process and system of the present invention can be used in assessing the 
accuracy of various sequence and map data currently being published in a variety of 
5 formats from a many different sources. 

B. ALIGNMENT AND REORDERING PROCESS AND SYSTEM 

Qyerall AliprnmRn^ Pmcess Flow Diagram 

Figure 8 shows an exemplary embodiment of the process for aligning 

10 sequences using optical maps according to the present invention which can also be 
executed by the processing device 10 of Figures 1 and 2. In this exemplary 
embodiment and similarly to the validation process illustrated in Figure 3, the optical 
mapping data 140 is forwarded to a technique 250 (e.g., the "gentig" program) which 
constructs one or more consensus maps 260 based on the optical mapping data 140 by 

15 considering the local variations among aligned single molecule maps. 

According to this exemplary embodiment of the alignment process of the 
present invention, the particular DNA sequence 210 or a portion of suchDNA 
sequence is provided. Thereafter, the data for this DNA sequence (or a portion 
thereof) is forwarded to a technique 220 which simulates a restriction enzyme 

20 digestion process to generate an "in silico" ordered restriction sequence map 230. The 
system and process of the present invention may then executes the validation 
algorithm 270 which determines the accuracy of the ordered restriction sequence map 
230 based on the data provided in the optical consensus map(s) 260. As with the 
validation procedure of Figure 3, this result can be output 280 in the form of a 

25 response a score (e.g., a rank for each ordered restriction map), a binary output (e.g., 
the accuracy validated vs. unvalidated), etc. The exemplary embodiments of the 
validation process and system of the present invention have been described in great 
detail herein above. Finally, the simulated ordered restriction sequence raap(s) can be 
aligned against the optical ordered restriction map in block 400. In one exemplary 

30 embodiment of the alignment process of the present invention, for each simulated 
ordered restriction map, the best anchoring position of such map is located on the 
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ordered restriction consensus map (e.g. an optical consensus map). The result of such 
location procedure is the generation of the entire set of anchoring positions of the 
simulated ordered restriction maps. In one preferred embodiment, the best anchoring 
positions are provided first to effectuate the best possible alignment This can be done 
5 using a one-dimensional Dynamic Programming Procedure. Those having ordinary 
skill in die art would clearly understand that it is possible to produce multiple 
alignments for the simulated ordered restriction maps due to many anchoring 
positions than may be available. Provided below are further details of the alignment 
process and system according to the present invention. 

10 

Detailed Flow Diagram of Alig nment Process 

Figure 9 shows an exemplary flow chart of the embodiment of the process 
according to the present invention for simulating a restriction digestion of the 
sequence map, validating the accuracy of the consensus optical order restriction map 

1 5 and/or the simulated map, and constructing an alignment therefore. This process can 
be performed by the processing device 10 which is shown in Figures 1 and 2. 
Similarly to the validation process shown in Figure 4, the processing device 10 
receives the optical ordered restriction data in step 410, which can be the consensus 
optical map(s) 260 shown in Figure 8. Then, in step 420, the processing device 10 

20 receives the sequence data, which is preferably the DNA sequence data 210 also 
shown in Figure 8. In step 430, the restriction digestion of the sequence data is 
simulated to obtain the simulated (in silico) ordered restriction map which is also 
shown in Figure 8 as the sequence map(s) 230. Thereafter, the optical ordered 
restriction map is compared to the simulated ordered restriction map to obtain one or 

25 more sets of most likely matches (step 440). The processing device 10 then 

determines if all the simulated ordered restriction maps were checked in step 445. If 
not, the process takes the next simulated ordered restriction map to be the current 
simulated ordered restriction map to be checked in step 450, and the comparison of 
step 440 is repeated again for the current simulated ordered restriction map. 

30 Otherwise, since it is determined that all the simulated ordered restriction maps were 
checked, all of the matches are ranked in step 460, and the processing device 10 
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determines the best match(s) for each simulated ordered restriction map based on the 
respective ranks in step 470. Then, in step 480, the alignment of the simulated 
ordered restriction map is constructed with respect to the optical ordered restriction 
maps based on the score of the matches. 

5 

Global Alignment 

To reiterate, the validation process and system of the present invention 
described above can match an ordered restriction sequence map against an ordered 
restriction consensus map. This validation process and system can be possibly 

10 described as a positioning process of the sequence map against the consensus map. 
When many sequences positioning are taken into consideration, it may be possible to 
describe the validation process as a "global" collective alignment against a particular 
consensus map. Thus, for the sake of clarity, the output of the procedure that 
produces this final result shall be referred to herein below as an alignment 

15 For example, the result of n 'Validation experiments" can be identified as n 

sets of possible sequence positions along the consensus map. Bach of these results can 
b e denoted as set $ (with 0< i £ n), with $[= k. Each of the k items in each St is a 
triple [st, X{i t f), Vftj)] - where S/ is a sequence map identifier, x^j) is the y-th alignment 
of si against the consensus map, and v^/) is the sequence alignment score (with 0<j< 

20 k) obtained from the single sequence (map) positioning process. The set containing 
every S t (with 0< i £ ri) is called S. 

An exemplary embodiment of the procedure to perform the matching, ranking 
and alignment steps 440-480 using the sequence maps and costs described above is 



r 




• 


Ik* 



25 alignment whose overall cost C can be computed by summing all the costs 

eventually selected. 

Initially, in step 510, the global cost C is set to infinity. Then, in step 520, the 

best matches out of each set Si of simulated ordered restriction maps (i.e., sequence 

maps) against the optical ordered restriction map (i.e., the consensus map) are 
30 selected. The best matches are grouped into a set of triples called 7i, and the cost V(/,y) 

and the position x^j) of each respective sequence Si are analyzed. The cost V of this 
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set of triples Ts is then computed using, e.g., a specialized ID Dynamic Programming 
Procedure (step 540), and compared to C If Vis equal to C plus or minus a tolerance 
value (step 550), then the set of triples Ts is determined to be the result of the 
alignment procedure (step 580). If Vis not equal to C plus or minus a tolerance value, 
5 then first C is equated to V, and the triple [s u x&jy v&jyi corresponding to the best of 
the "second best" among the S/'s is selected (step 570). The triple [*& x^ v^] is thai 
removed from the set of triples Ts, and the triple [s* *ft/> V( V J (withy different from 
j *) is inserted into the set of triples T s (step 575). A new Vis then computed from the 
updated set of triples Ts (step 540). 
1 0 Provided below is an exemplary map-based alignment algorithm/problem 

which can be utilized with the alignment process and system of the present invention. 
Let S = \Jj5t . For example, at most one triple from each Si, can be selected while 

satisfying the following global conditions/objectives which can possibly be relaxed: 

1 . When anchoring two or more selected triples within the alignment T s , two 
1 5 selected sequences s p and s q anchored at their respective x^y and X( q ^ 9 

preferably do not overlap (for suitable p, q, a, and b and p not equal to q)\ 

2. ^J/i x v (f jj) is minimized over each j in the sequences set Si so that as 

many as possible sequence maps jS/'s are included in the alignment; and 

3. the number of non-selected sequences, n - is minimized. 

20 where It is an indicator variable assuming a value 1 if the triplet from the sequence Si 

is included in the chosen set T s , and 0 otherwise. 

It should be understood that the objectives (2) and (3) provided above may 

conflict In particular, the minimum of the objective (2) is achieved when no sequence 

is selected, while with the objective (3), it is preferable to choose as many sequences 
25 as possible, irrespective of the score values. This conflict may be resolved by, e.g., a 

weighting scheme involving a Lagrangian-like term which linearly combines the two 

contradictory objectives. 

It is possible to solve this problem by using various approximation algorithms. 

For example, the following two algorithms/procedures: 
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1. a "Greedy" algoritbm/procedure, and 

2. a "Dynamic Programming 11 algorithm/procedure. 

During the experimentation of the alignment system and process of the present 
5 invention, the Greedy algorithm/procedure and the Dynamic Programming 

algorithm/procedure were utilized with successful results. Provided below are the 
detailed description of these algorithms/procedures (l)-(2) of the present invention. 

Greedy Alporitbm/Procedure 

10 A solution P can be constructed such that each St is ordered by value v^jy 

Then, the best item from each sequence St is placed in the partial solution P by 
selecting the sequences in the order imposed by each X(tj). It should be understood that 
the final solution P is not guaranteed to be optimal; however, this solution may 
provide the results which may be acceptable to the implemented of the alignment 

15 procedures. 

Dynamic ProppmiTif ag/Procedure 

This algorithm/procedure is based on the traditional dynamic prog^^imning 
approach. Indeed, the implementation of this algorithm/procedure is straight forward 
20 and space-efficient as provided below. The problem can first be considered for one 
exemplary case when k= 1, and an appropriate algorithm can be selected. Next, the 
general case when k > 1 can be considered, and good approximation heuristics may be 
devised. 

25 (a) Alignment procedure for Sequence number k being 1. If the number of 

sequences £ present in each set St of triples is restricted to be 1 (e.g., being the best 
score), then the problem yields to a feasible and efficient algorithm. In general, if the 
sequence matches uniquely to one map location, then this case should apply. An 
exemplary embodiment of the alignment algorithm for the dynamic programming 

30 solution, constructing the solution P, is described below. In particular, 
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L Sort all the triples of sequence, cost and position, <y* xp,y, vgr y> in 

ascending x^y order, and store the result in a list L. Thereafter, the indices 
i and j can be assumed to range over the list L. 

2. Construct two vectors C[i] and BfiJ (0 < i £ n), where each entry in global 
5 cost C is defined to be the cost of including in an alignment that already 

contains sequences, or a subset thereof, up to Sji and the index j is stored in 

The update rules for C[i] and B[<| preferably search backward in the C vector 
for values which minimize the cost function, and set B to <( point back" to the chosen 
1 0 point. For example, 

C[z] = max (c[/]+ W(fc i)) such that iSi does not overlap with S j9 

0<j<i 

W(k; i) function takes into consideration the conflicting nature of the objectives 
described above. Since it is most likely not possible to optimize both objectives 
15 simultaneously, a weigjht function can be generated (where a user may supply the 
parameter A,) which would preferably account for both objectives. Two exemplary W 
functions are provided below: 

20 Wi takes into account the "span" covered by the selected sequences (where |S,| is the 
size of the sequence). Wz takes into account the number of sequences which were 
selected The parameter X is controlled by the user. 



(b) Alignment Procedure for Sequence Number k > 1. If sequence number 
25 k > 1 , then the procedure may be more complex. Since for each set S iy there may be k 
number of alignments to select from, the complexity involved in a straightforward 
generalization of the preceding procedure is conjectured to grow exponentially. It is 
possible to use a heuristic procedure/algorithm to produce an acceptable solution in 
the case when the sequence number k > 1 . The concept of this procedure is to iterate 
30 or repeat the dynamic programming procedure (i.e., k= 1 case) on an input set that 
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takes the best possible solutions from each sequence St while ignoring the non- 
overlapping constraint. This solution can be further improved in the subsequent 
iteration by constructing a new input to the DPP procedure (i.e., where k- 1) that 
consists of the preceding solution augmented with an element from each sequence St 

5 excluded in the preceding solution. Because the preceding solution is also a solution 
of the new expression, the new solution is at least as effective as the solution 
previously provided In each iteration, the basic solution can also be a general (and 
possibly suboptimal) solution. Because when an item is removed from consideration, 
it is never again reconsidered; thus, according to a preferred embodiment of the 

10 present invention, there can be only 0(kn) iterations, and each iteration involves OQi 2 ) 
work. Hence a naive analysis yields an 0{kn ) time algoritt 



■■■■I 



Experimental Results 

Figure 1 1 shows an illustration of a possible alignment of an exemplary 
15 chromosome arrangement using the system and method of the present invention. In 
particular, a region of the alignment of P. falciparum's Chromosome 12 is shown 
therein which was generated using the software implementing an exemplary 
embodiment of the validation, alignment and reordering system and method of the 
present invention. The two underlined maps in positiQn 39 and 50 of the figure 
20 illustrate an acceptable anchoring of "contigs" 1 1 and 13 to the optical ordered 
restricted map. Also, the alignment was obtained without any overlap filter. 



One having ordinary skill in the art would clearly recognize that many other 
applications of the embodiments of the system and process for validating and aligning 
25 of the simulated ordered restriction maps according to the present invention. Indeed, 
the present invention is in no way limited to die exemplary applications and 
embodiments thereof described above. 
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0 

CLAMS 

1 . A process for comparing ordered segments of a first DNA map with ordered 
segments of a second DNA map to determine a level of accuracy of at least one of the 

5 first DNA map and the second DNA map, comprising the steps of: 

a) receiving the first and second DNA maps, the first DNA map corresponding to 
a sequence DNA map, the second DNA map corresponding to a genomic 
consensus DNA map as provided in an ordered restriction DNA map; and 

b) validating the accuracy of at least one of the first DNA map and the second 
10 DNA map based on information associated with the first and second DNA 

maps. 

2. The process according to claim 1, wherein the validating step comprises 
determining whether one or more matches exist between ordered segments of the first 

1 5 DNA map and the ordered segments of the second DNA map. 

3 . The process according to claim 2, wherein the validating step further 
comprises obtaining a number of the matches which exist between die segments of the 
first DNA map and the segments of the second DNA map after the detennining 

20 substep. 

4. The process according to claim 1, wherein the validating step comprises 
detennining whether the first DNA map includes one or more cuts which are missing 
from the second DNA map. 

25 

5* The process according to claim 4, wherein the validating step further 
comprises obtaining a number and locations of the missing cuts, after the determining 
substep, based on the first and second DNA maps. 
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6. The process according to claim 1, wherein the validating step comprises 
determining whether the second DNA map includes one or more cuts which are 
absent from the first DNA map. 

5 7. The process according to claim 6, wherein the validating step further 

comprises obtaining a number and locations of the absent cuts, after the detamining 
substep, based on the first and second DNA maps. 

8. The process according to claim 3, wherein the validating step further 
10 comprises the substeps of: 

L determining whether the first DNA map includes one or more cuts 

which are missing from the second DNA map, 
ii. after substep i, obtaining a first number and locations of the missing 
cuts based on the first and second DNA maps, 
15 iiL determining whether the second DNA map includes one or more cuts 



IV. 




20 9. 




25 



predetermined threshold, and 
hi. the second number of the absent cuts is greater than a second 



predetermined threshold. 



10. The process according to claim 1, wherein the first DNA map is an in-silico 
ordered restriction map obtained from a DNA sequence. 
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1 1 . The process according to claim 10, wherein the first DNA map includes 
identification data and at least one vector of the segments of the first DNA map. 

12. The process according to claim 1 1 , wherein the at least one vector of the first 
5 segments encodes a size of base-pairs of the DNA sequence. 

1 3 . The process according to claim 12, wherein the second DNA map includes 
identification data and at least one variable-length vector representing its ordered 
segments. 

10 

14. The process according to claim 1, wherein the second DNA map is a 
subsequence of a genome-wide ordered restriction map of an optical DNA map, 

1 5 . The process according to claim 1 , wherein the validation step comprises 

15 determining the accuracy of at least one of the first DNA map and the second DNA 
map using the following probability density function: 

T>i(D\H(G 9 p e ,p f )) 

where: 

D is the second DNA map, 

20 H is the first DNA map, 

a is a standard deviation summarizing map-wide standard deviation data, 
p c is a probability of a positive cut of a DNA sequence, and 

p f is a probability of a false-positive cut of the DNA sequence. 

25 16, The process according to claim 1, wherein the accuracy is validated as a 
function of an orientation of the first DNA map with respect to an orientation of the 
second DNA map. 

17. The process according to claim 1, wherein the validation step comprises the 
30 substeps of: 
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i. executing a dynamic programming procedure ("DPP") on the first and 
second DNA maps to generate a first table of partial and complete 
alignment scores, and first auxiliary tables and first data structures to 
keep track of number and locations of cuts and segment matches, 
5 ii. receiving a third DNA map which is a reverse map of the first DNA 

map, 

iii. executing the DPP on the second and third DNA maps to generate a 
second table of partial and complete alignment scores, and second 
auxiliary tables and second data structures to keep track of number and 

10 locations of the cuts and the segment matches, and 

iv. analyzing the last row of the first table and a last row of the second 
table to obtain at least one optimum alignment of the first and second 
DNA maps, and 

v. reconstructing at least one of an optimum alignment and sub-optimal 
15 : alignments using the first and second auxiliary tables and data 

structures. 

18. The process according to claim 1, wherein the accuracy is validated by 
matching an extension of a first left end segment of the segments of the first DNA 

20 map to at least one of the segments of the second DNA map. 

19. The process according to claim 1, wherein the accuracy is validated by 
matching an extension of a first right end segment of the segments of the first DNA 
map to at least one of the segments of the second DNA map. 

25 

2G. The process according to claim 1, further comprising the step of: 

d) detecting an alignment of the first DNA map with respect to the second DNA 

map, the alignment being indicative of sequence positions of the segments of 

the first DNA map along the second DNA map. 
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21 . A software system which, when executed on a processing device, configures 
the processing device to compare segments of a first DNA map with segments of a 
second DNA map to determine a level of accuracy of at least one or both of the first 
DNA map and the second DNA map, the software system comprising: 
5 a processing subsystem which, when executed on the processing device, 

configures the processing device to perform the following steps: 

a) receives the first and second DNA maps, the first DNA map 
corresponding to a sequence DNA map, the second DNA map 
corresponding to a genomic consensus DNA map as provided in an 

10 ordered restriction DNA map, and 

b) validates the accuracy of at least one of the first DNA map and the 
second DNA map based on information associated with the first and 
second DNA maps. 

15 22. The software system according to claim 21, wherein, when validating the 
accuracy, the processing subsystem determines whether one or more matches exists 
between at least one of the segments of the first DNA map and at least one of the 
segments of the second DNA map. 

20 23 . The software system according to claim 22, wherein, when validating the 
accuracy, the processing subsystem obtains a number of the matches which exist 
between the segments of the first DNA map and the segments of the second DNA 
map. 

25 24. The software system according to claim 2 1 , wherein, when validating the 
accuracy, the processing subsystem determines whether the first DNA map includes 
one or more cuts which are missing from the second DNA map. 

25. The software system according to claim 4, wherein, when validating the 
30 accuracy, the processing subsystem obtains number and location of the missing cuts 
based on the first and second DNA maps. 
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26. The software system according to claim 2 1 , wherein, when validating the 
accuracy, the processing subsystem determines whether the second DNA map 
includes one or more cuts which are absent from the first DNA map. 

27. The software system according to claim 24, wherein, when validating the 
accuracy, the processing subsystem obtains number and location of the absent cuts 
based on the first and second DNA maps. 

28. The software system according to claim 23, wherein, when validating the 
accuracy, the processing subsystem: 

i. determines whether the first DNA map includes one or more cuts 
which are missing from the second DNA map, 

ii. obtains number and location of the missing cuts based on the first and 
second DNA maps, 

iii. determines whether the second DNA map includes one or cuts which 
are absent from the first DNA map, and 

iv. obtains a second number of the absent cuts based on the first and 
second DNA maps, 

29. The software system according to claim 28, wherein, when executed on the 
processing device, the processing subsystem further configures the processing device 
to generate an error indication if at least one of: 

i. the number of the matches is less than a match threshold, 

ii. the first number of the missing cuts is greater than a first 
predetermined threshold, and 

iii. the second number of the absent cuts is greater than a second 
predetermined threshold. 

30 30. The software system according to claim 2 1 , wherein the first DNA map is an 
in-silico ordered restriction map obtained from a DNA sequence. 



10 



15 



20 



25 
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3 1 . The software system according to claim 30, wherein the first DNA map 
includes identification data and a variable-length vector of the segments of the first 
DNA map. 

5 32. The software system according to claim 3 1 , wherein the vector of the 

segments of the first DNA map encodes a size of base-pairs of the DNA sequence. 

33. The software system according to claim 32, wherein the second DNA map 
includes identification data and a variable-length vector of the segments of the second 

10 DNA map. 

34. The software system according to claim 21, wherein the second DNA map is a 
genome-wide ordered restriction map of an optical DNA map. 

15 35. The software system according to claim 21 , wherein, when validating the 
accuracy, the processing subsystem determines the accuracy of the at least one of the 
first DNA map and the second DNA map using the following probability density 
function: 

VKD\H(p 9 p c , Pf )) 

20 where: 

D is the second DNA map, 
His the first DNA map, 

a is a standard deviation summarizing map-wide standard deviation data, 
p c is a probability of a positive cut of a DNA sequence, and 

25 Pf is a probability of a false-positive cut of the DNA sequence. 

36. The software system according to claim 21, wherein the accuracy is validated 
as a function of an orientation of the first DNA map with respect to an orientation of 
the second DNA map. 
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37. The software system according to claim 21, wherein, when validating the 
accuracy, the processing subsystem: 

i, executes a dynamic programming procedure ("DPP") on the first and 
second DNA maps to generate a first table of partial and complete 

5 alignment scores, and first auxiliary tables and data structures to keep 

track of the number and locations of cuts and segment matches, 

ii. receives a third DNA map which is a reverse map of the first DNA 

■ 

map, 

ill executes the DPP on the second and third DNA maps to generate a 
10 second table of partial and complete alignment scores, and second 

auxiliary tables and data structures to keep track of die number and 
locations of cuts and segment matches, 

iv. analyzes a last row of the first table and a last row of die second table 
to obtain at least one optimum alignment of the first and second DNA 

15 maps, and 

v. reconstructing at least one of an optimum alignment and sub-optimal 
alignments using the first and second auxiliary tables and data 
structures. 



20 38. The software system according to claim 21, wherein the accuracy is validated 
by matching an extension of a first left end segment of the segments of the first DNA 
map to at least one of the segments of the second DNA map. 

39. The software system according to claim 21, wherein the accuracy is validated 
25 by matching an extension of a first right aid segment of the first DNA map to at least 
one of the segments of the second DNA map. 



40, The software system according to claim 21, wherein, when executed on the 
processing device, the processing subsystem further configures the processing device 
30 to determine an alignment of the first DNA map with respect to the second DNA map, 
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the alignment being indicative of sequence positions of the first segments along the 
second DNA map. 

4 L A process for aligning a plurality of DNA sequences with a DNA map, 
5 comprising the steps of: 

a) receiving the DNA sequences and the DNA map, the DNA sequences being 
fragments of a genome, the DNA map corresponding to a genomic consensus 
DNA map which relates to an ordered restriction DNA map; 

b) validating a level of accuracy of at least one of the DNA sequences and the 
10 DNA map based on information associated with the DNA sequences and the 

DNA map; and 

c) determining locations of the DNA map at which the DNA sequences are 
capable of being associated with particular segments of the DNA map. 

15 42. The process according to claim 41, wherein the locations are determined for 
each of the DNA sequences. 

» ■ 

43. The process according to claim 41, wherein the locations are positions on the 
DNA map at which the corresponding DNA sequences are anchorable. 



20 



44. The process according to claim 41, wherein the locations define at least one 
alignment of the DNA sequences with respect to the DNA map. 



45. The process according to claim 44, wherein the at least one alignment includes 
25 multiple alignments of the DNA sequences with respect to the DNA map. 

46. The process according to claim 45, further comprising the step of: 

d) ranking the multiple alignments based on a predetermined criteria to obtain a 
score set which includes a particular score for each of the multiple alignments. 
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47. The process according to claim 46, wherein the determining step comprises: 

i. providing the DNA sequences in a first order of the multiple 
alignments with respect to the DNA map, and 

ii. determining a position for each of the DNA sequences, with respect to 
5 the DNA map, by selecting the DNA sequences to be in a second order 

corresponding to the score set 

48. The process according to claim 41, wherein the determining step comprises 
restricting each of the DNA sequences to be associated with only one of the locations 

10 on the DNA map. 

49. The process according to claim 48, wherein the determining step produces a 
single alignment of the DNA sequences with respect to the DNA map. 

15 50. The process according to claim 41, wherein the determining step includes: 

i. locating an optimal one of the locations for each of the DNA sequences 
to obtain an alignment solution, and 
< ii. repeating sub step ii for each of the locations. 

20 51, The process according to claim 50, wherein the locating substep is repeated 
for each subsequent one of the locations and excluding the alignment solution from a 
preceding locating substep. 

52. The process according to claim 5 1 , wherein each subsequent locating substep 
25 is performed by relaxing at least one particular constraint to determine the respective 

locations. 

53. The process according to claim 52, wherein the at least one particular 
constraint includes a first requirement that two of the DNA sequences are prevented 

30 from overlapping when associated with the respective locations on the DNA map. 
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54. The process according to claim 52, wherein the particular constraint includes 
at least one of: 

L a second requirement that a maximum number of the DNA sequences 
are associated with the respective locations on the DNA map, and 
5 ii a third requirement that an overall score of the alignment of the DNA 

sequences with respect to the locations on the DNA map is minimized 
or maximized. 

55. The process according to claim 54, further comprising the step of: 
10 e) assigning respective weighs to the second requirement and the third 

requirement 

56- The process according to claim 41 , wherein the ordered restriction DNA map 
is an optical DNA map. 

15 

57. A software system which, when executed on a processing device, configures 
the processing device to align a plurality of DNA sequences with a DNA map, the 
software system comprising: 

a processing subsystem which, when executed on the processing device, 
20 configures the processing device to perform the following steps: 

a) receives the DNA sequences and the DNA map, the DNA sequences 
being fragments of a genome, the DNA map corresponding to a 
genomic consensus DNA map which relates to an ordered restriction 
DNA map, 

25 b) validates a level of accuracy of at least one of the DNA sequences and 

* 

the DNA map based on information associated with the DNA 
sequences and the DNA map, and 
c) determines locations of the DNA map at which the DNA sequences are 
capable of being associated with particular segments of the DNA map. 
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58. The software system according to claim 57, wherein the locations are 
determined for each of the DNA sequences. 

59. The software system according to claim 57, wherein the locations are positions 
5 on the DNA map at which the corresponding DNA sequences are anchorable. 

60. The software system according to claim 57, wherein the locations define at 
least one alignment of the DNA sequences with respect to the DNA map. 

10 61. The software system according to claim 60, wherein the at least one alignment 
includes multiple alignments of the DNA sequences with respect to the DNA map. 

62. The software system according to claim 61, wherein the processing subsystem, 
when executed on the processing device, configures the processing device to rank the 

15 multiple alignments based on a predetermined criteria to obtain a score set which 
includes a particular score for each of the multiple alignments. 

63 . The software system according to claim 62, wherein the processing device is 
configured by the processing subsystem to determine the locations of the DNA 

20 sequences by: 

L providing the DNA sequences in a first order of the multiple 
alignments with respect to the DNA map, and 

■ 

ii determining a position for each of the DNA sequences, with respect to 
the DNA map, by selecting the DNA sequences to be in a second order 
25 corresponding to the score set. 

64. The software system according to claim 57, wherein the processing device is 
configured by the processing subsystem to determine the locations on the DNA 
sequences by restricting each of the DNA sequences to be associated with only one of 

30 the locations of the DNA map. 
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65. The software system according to claim 64, wherein the processing device is 
configured by the processing subsystem to determine the locations of the DNA 
sequences for producing a single alignment of the DNA sequences with respect to the 
DNA map. 

5 

66. The software system according to claim 57, wherein the processing device is 
configured by the processing subsystem to determine the locations of the DNA 
sequences by; 

i. locating an optimal one of the locations for each of the DNA sequences 
10 to obtain an alignment solution, and 

ii. repeating substep ii for each of the locations. 

67. The software system according to claim 66, wherein the locating of the 
optimal one of the locations is repeated for each subsequent one of Hie locations and 

1 5 the alignment solution is excluded from a preceding locating iteration* 

68. The software system according to claim 67, wherein the processing subsystem 
configures the processing device to perform each subsequent locating iteration by 
relaxing at least one particular constraint to determine the respective locations. 

20 

69. The software system according to claim 68, wherein the at least one particular 
constraint includes a first requirement that two of the DNA sequences are prevented 
from overlapping when associated with the respective locations on the DNA map. 

25 70. The software system according to claim 68, wherein the particular constraint 
includes at least one of: 

i. a second requirement that a maximum number of the DNA sequences 
are associated with the respective locations on the DNA map, and 

ii. a third requirement that an overall score of the alignment of the DNA 
30 sequences with respect to the locations on the DNA map is minimized 

or maximized. 



WO 02/26934 



PCTAJS01/30426 



-53^ 



7 1 . The software system according to claim 70, wherein the processing subsystem, 
when executed on the processing device, configures the processing device to assign 
respective weighs to the second requirement and the third requirement. 

5 

72. The software system according to claim 70, wherein the ordered restriction 
DNA map is an optical DNA map. 

73. A process for aligning a plurality of DNA sequences with a DNA map, 
10 comprising the steps of: 

a) receiving the DNA sequences and the DNA map, the DNA sequences being 
fragments of a genome, the DNA map corresponding to a genomic consensus 
DNA map which relates to an ordered restriction DNA map; and 

b) determining locations of the DNA map at which the DNA sequences are 
15 capable of being associated with particular segments of the DNA map, the 

locations of the DNA sequences being determined by: 

i. locating an optimal one of the locations for each of the DNA 
sequences, and 

ii. repeating substep i for each of the locations. 

20 

74. A software system which, when executed on a processing device, configures 
the processing device to align a plurality of DNA sequences with a DNA map, the 
software system comprising: 

a processing subsystem which, when executed on the processing device, 
25 configures the processing device to perform the following steps: 

a) receives the DNA sequences and the DNA map, the DNA sequences 
being fragments of a genome, the DNA map corresponding to a 
genomic consensus DNA map which relates to an ordered restriction 
DNA map, and 
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deteraiinea locations of the DNA map at which the DNA sequences are 
capable of being associated with particular segments of the DNA map 
by: 

i locating an optimal one of the locations for each of the DNA 
sequences, and 

ii. repeating the locating of the optimal one of the locations for 
each of the locations. 
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